test: add real runAgentInSandbox() E2E tests for telegram injection by jyaunches · Pull Request #1267 · NVIDIA/NemoClaw

jyaunches · 2026-04-01T19:36:09Z

Summary

Add Phases 7-11 to test-telegram-injection.sh addressing PR #1092 feedback from @cv: exercise real production code paths instead of ad-hoc SSH commands.

Depends on #1266 (Brev E2E infra hardening) — merge that first.

Split from #1219 to keep test logic separate from infrastructure changes.

Changes

test/e2e/test-telegram-injection.sh

New test phases:

Phase 7: Real runAgentInSandbox() with $(command) and backtick injection payloads (T9-T11)
Phase 8: Multi-line message handling with embedded injection (T12-T13)
Phase 9: Session ID option injection prevention via sanitizeSessionId, including negative Telegram chat IDs (T14-T15)
Phase 10: Empty and special-character-only session IDs (T16-T17)
Phase 11: Cleanup of marker files and temp scripts

These tests are expected to FAIL on unfixed code (runAgentInSandbox is not exported, signature doesn't accept options), proving they detect the vulnerability. They will pass once PR #119 is merged.

Testing

shellcheck passes (via pre-commit)
npm test — CLI tests pass
Tests validated on Brev CPU instances in prior runs on test: add real runAgentInSandbox() E2E tests for telegram injection (Phases 7-11) #1219

Type of Change

Code change for a new feature, bug fix, or refactor.
Code change with doc updates.
Doc only.

Refs: #118, PR #1092

Summary by CodeRabbit

Release Notes

New Features
- Improved GPU detection with more robust validation during setup
- Added retry logic for Node.js and Docker installations with increased visibility into the process
Bug Fixes
- Fixed Docker runtime configuration on CPU-only systems to prevent GPU-related errors
- Resolved apt/dpkg lock contention issues during initial setup
Tests
- Expanded security testing coverage for environment variable injection prevention
- Added comprehensive edge-case validation for session ID handling

Shared E2E infrastructure improvements that make Brev-based CI reliable on CPU-only instances: brev-setup.sh: - Extract HAS_GPU flag for consistent GPU detection — nvidia-smi must run successfully, not just exist on PATH (Brev GPU images ship the binary even on CPU instances) - All GPU gates (container toolkit, Docker runtime reset, vLLM) use the single HAS_GPU flag - Replace cloud-init || true with proper error check + warning - Add timeout (300s) and quiet window (5s) to apt wait loop to prevent indefinite hangs and races - Add retry loops (5 attempts, 30s backoff) for Node.js and Docker apt-get installs - Unsuppress NodeSource installer output for debugging - Reset Docker default runtime to runc on CPU-only instances where Brev pre-configures nvidia as default setup.sh: - Check nvidia-smi exit code instead of just PATH presence for GPU gateway flag brev-e2e.test.js: - Use known GCP instance type (n2-standard-4) instead of cpu search - Add shellQuote() for safe secret interpolation in SSH commands - Delete leftover instances before creating new ones - Wait for cloud-init before running bootstrap

@cv

Add Phases 7-11 to test-telegram-injection.sh addressing PR NVIDIA#1092 feedback from @cv: exercise real production code paths instead of ad-hoc SSH commands. New test phases: - Phase 7: Real runAgentInSandbox() with $(command) and backtick injection payloads (T9-T11) - Phase 8: Multi-line message handling with embedded injection (T12-T13) - Phase 9: Session ID option injection prevention via sanitizeSessionId, including negative Telegram chat IDs (T14-T15) - Phase 10: Empty and special-character-only session IDs (T16-T17) - Phase 11: Cleanup of marker files and temp scripts These tests are expected to FAIL on unfixed code (runAgentInSandbox is not exported, signature doesn't accept options), proving they detect the vulnerability. They will pass once PR NVIDIA#119 is merged. Refs: NVIDIA#118, PR NVIDIA#1092

coderabbitai · 2026-04-01T19:36:29Z

📝 Walkthrough

Walkthrough

Scripts across provisioning, setup, and testing were updated. GPU detection now validates successful execution rather than command presence. Apt access is serialized with cloud-init synchronization, background service termination, process killing, lock clearing, and idle-loop verification. Node.js and Docker installations use retry logic with backoff. E2E tests now specify instance types directly and await cloud-init completion. Telegram injection tests expand coverage with sandboxing and session ID sanitization validation.

Changes

Cohort / File(s)	Summary
GPU Detection & Routing `scripts/brev-setup.sh`, `scripts/setup.sh`	Changed GPU availability checks from `command -v nvidia-smi` (PATH check) to direct `nvidia-smi` execution validation; routes GPU-dependent logic through unified `HAS_GPU` flag.
Apt Lock & Serialization `scripts/brev-setup.sh`	Introduced exclusive apt access phase: waits for cloud-init completion, stops/disables background apt timers, force-kills apt/dpkg processes, clears locks, configures dpkg, then loops until idle with max timeout enforcement.
Installation Retry Logic `scripts/brev-setup.sh`	Updated Node.js and Docker installations to retry up to 5 times with 30s backoff; removed output suppression to `/dev/null` for visibility.
Docker Runtime Configuration `scripts/brev-setup.sh`	On CPU-only systems, removes `"default-runtime": "nvidia"` from Docker daemon config when GPU is unavailable and restarts Docker to use default `runc` runtime.
E2E Provisioning `test/e2e/brev-e2e.test.js`	Replaced numeric CPU/RAM constraints with single instance type override (`BREV_INSTANCE_TYPE`); added pre-provisioning cleanup, changed environment variable quoting strategy, and inserted cloud-init synchronization before bootstrap.
E2E Injection Testing `test/e2e/test-telegram-injection.sh`	Expanded test coverage (Phases 7–10) with injection prevention validation via sandbox markers, process argument inspection, session ID sanitization testing, and edge case handling for empty/special-character-only session IDs.

Sequence Diagram(s)

sequenceDiagram
    actor Script as Brev Setup<br/>Script
    participant CloudInit as cloud-init
    participant SystemD as systemd<br/>(timers)
    participant AptDpkg as apt/dpkg<br/>Processes
    participant Locks as Lock<br/>Files
    participant DpkgConfig as dpkg<br/>--configure

    Script->>CloudInit: Wait for cloud-init<br/>status --wait
    CloudInit-->>Script: Ready
    
    Script->>SystemD: Stop/disable apt-daily*<br/>& unattended-upgrades
    SystemD-->>Script: Stopped
    
    Script->>AptDpkg: Force-kill apt-get,<br/>apt, dpkg processes
    AptDpkg-->>Script: Terminated
    
    Script->>Locks: Clear apt/dpkg<br/>lock files
    Locks-->>Script: Cleared
    
    Script->>DpkgConfig: Run dpkg --configure -a
    DpkgConfig-->>Script: Configuration complete
    
    loop Until Idle or Timeout
        Script->>AptDpkg: Check for<br/>running processes
        alt Idle Window Detected
            AptDpkg-->>Script: No activity ✓
            Script->>Script: Exit loop
        else Still Active
            AptDpkg-->>Script: Still running
            Script->>Script: Wait & retry
        end
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 With GPU flags now truly tested,
And apt locks soundly nested,
Retries bounce like hops so keen,
The cleanest setup ever seen!
Sessions safe from injection's play,
Terraform'd tests light the way! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title describes the main addition of real runAgentInSandbox() E2E tests for Telegram injection, which aligns with the primary test file change (385 lines added to test-telegram-injection.sh). However, the PR also includes significant infrastructure hardening in brev-setup.sh and setup.sh that constitutes a substantial portion of the work and changes.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Warning

Review ran into problems

🔥 Problems

Timed out fetching pipeline failures after 30000ms

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

test/e2e/test-telegram-injection.sh (1)
569-595: Consider running the backgrounded command in an explicit subshell for clarity.

The background job at lines 569-571 uses cd "$REPO" && ... which implicitly creates a subshell when backgrounded. While this works, an explicit subshell (cd "$REPO" && ...) would make the intent clearer and ensure consistent behavior across shell versions.

That said, the current implementation is functionally correct—kill $BRIDGE_PID will signal the job leader, and wait $BRIDGE_PID will reap it properly.
🔧 Optional: explicit subshell
-  cd "$REPO" && OPENSHELL_PATH="$OPENSHELL_PATH" \
+  (cd "$REPO" && OPENSHELL_PATH="$OPENSHELL_PATH" \
     NEMOCLAW_SANDBOX_NAME="$SANDBOX_NAME" \
-    timeout 60 node /tmp/test-real-bridge.js 'Hello world' 'e2e-ps-check' &
+    timeout 60 node /tmp/test-real-bridge.js 'Hello world' 'e2e-ps-check') &
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/test-telegram-injection.sh` around lines 569 - 595, The backgrounded
command that sets OPENSHELL_PATH/NEMOCLAW_SANDBOX_NAME and runs node
/tmp/test-real-bridge.js should be executed in an explicit subshell to make the
intent clear and ensure consistent behavior across shells; wrap the current `cd
"$REPO" && ...` invocation in a subshell (i.e., `(cd "$REPO" &&
OPENSHELL_PATH="..." NEMOCLAW_SANDBOX_NAME="..." timeout 60 node
/tmp/test-real-bridge.js 'Hello world' 'e2e-ps-check') &` so BRIDGE_PID
continues to be captured from `$!` and the environment/working-directory change
is isolated. Ensure BRIDGE_PID is still assigned immediately after backgrounding
and that the later kill $BRIDGE_PID and wait $BRIDGE_PID logic remains
unchanged.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@test/e2e/test-telegram-injection.sh`:
- Around line 569-595: The backgrounded command that sets
OPENSHELL_PATH/NEMOCLAW_SANDBOX_NAME and runs node /tmp/test-real-bridge.js
should be executed in an explicit subshell to make the intent clear and ensure
consistent behavior across shells; wrap the current `cd "$REPO" && ...`
invocation in a subshell (i.e., `(cd "$REPO" && OPENSHELL_PATH="..."
NEMOCLAW_SANDBOX_NAME="..." timeout 60 node /tmp/test-real-bridge.js 'Hello
world' 'e2e-ps-check') &` so BRIDGE_PID continues to be captured from `$!` and
the environment/working-directory change is isolated. Ensure BRIDGE_PID is still
assigned immediately after backgrounding and that the later kill $BRIDGE_PID and
wait $BRIDGE_PID logic remains unchanged.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 1a1d5163-1d85-4ee7-8e01-e6c247d15bea

📥 Commits

Reviewing files that changed from the base of the PR and between e231d32 and 77e0862.

📒 Files selected for processing (4)

scripts/brev-setup.sh
scripts/setup.sh
test/e2e/brev-e2e.test.js
test/e2e/test-telegram-injection.sh

wscurran · 2026-04-01T20:57:58Z

✨ Thanks for submitting this pull request, which proposes a way to add real runAgentInSandbox() E2E tests for Telegram injection. Possibly related open issues: #1107, #1114, #1109.

Possibly related open issues:

#1107 refactor(test): wire E2E security tests to real production code paths
#1114 Error: EACCES: permission denied - Models , Whatsapp & Telegram
#1109 Discord, Slack, and Telegram bot tokens bypass OpenShell provider system — injected as plaintext env vars into sandbox

jyaunches · 2026-04-02T18:32:15Z

Closing this PR — per @aerickson's direction, the Telegram bridge is going away entirely, making this injection test work redundant/no longer needed.

jyaunches added 2 commits April 1, 2026 15:33

coderabbitai bot reviewed Apr 1, 2026

View reviewed changes

wscurran added the Integration: Telegram Use this label to identify Telegram bot integration issues with NemoClaw. label Apr 1, 2026

jyaunches closed this Apr 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: add real runAgentInSandbox() E2E tests for telegram injection#1267

test: add real runAgentInSandbox() E2E tests for telegram injection#1267
jyaunches wants to merge 2 commits intoNVIDIA:mainfrom
jyaunches:test/telegram-injection-e2e

jyaunches commented Apr 1, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Apr 1, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Review ran into problems

Uh oh!

coderabbitai bot left a comment

Uh oh!

wscurran commented Apr 1, 2026

Uh oh!

jyaunches commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jyaunches commented Apr 1, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

test/e2e/test-telegram-injection.sh

Testing

Type of Change

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Review ran into problems

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

wscurran commented Apr 1, 2026

Uh oh!

jyaunches commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jyaunches commented Apr 1, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 1, 2026 •

edited

Loading